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Abstract —From a machine learning perspective, the human 
ability localize sounds can be modeled as a non-parametric and 
non-linear regression problem between binaural spectral features 
of sound received at the ears (input) and their sound-source direc¬ 
tions (output). The input features can be summarized in terms of 
the individual’s head-related transfer functions (HRTFs) which 
measure the spectral response between the listener’s eardrum 
and an external point in 3D. Based on these viewpoints, two 
related problems are considered: how can one achieve an optimal 
sampling of measurements for training sound-source localization 
(SSL) models, and how can SSL models be used to infer the 
subject’s HRTFs in listening tests. First, we develop a class of 
binaural SSL models based on Gaussian process regression and 
solve a forward selection problem that finds a subset of input- 
output samples that best generalize to all SSL directions. Second, 
we use an active-learning approach that updates an online SSL 
model for inferring the subject’s SSL errors via headphones and 
a graphical user interface. Experiments show that only a small 
fraction of HRTFs are required for 5° localization accuracy and 
that the learned HRTFs are localized closer to their intended 
directions than non-individualized HRTFs. 

Index Terms —Head-related transfer function, Gaussian pro¬ 
cess regression, sound-source localization, active-learning 


1. Introduction 

Many animals possess a remarkable omnidirectional sound 
localization ability enabled by subconsciously processing sub¬ 
tle features in the sounds received at the two ears from a 
common source location. For humans, these features arise due 
to the incoming acoustic wave scattering off the listener’s 
anatomical features (head, torso, pinnae) before reaching the 
eardrum. The spectral ratio between the sounds recorded at the 
eardrum and that would have been obtained at the center of the 
head in absence of the listener is known as the head-related 
transfer function (HRTF) Q; HRTFs are thus specific to the 
individual’s anthropometry, wave direction, and contain other 
important cues such as the interaural time delay (ITD) and 
the interaural level difference (ILD) ||2l. Moreover, knowledge 
of individualized HRTFs allow for perceptually accurate 3D 
spatial audio synthesis 11, a, a. 

We investigate the pre-image problem, namely how pairs of 
left and right ear HRTFs and functions of HRTFs (features 
based on them) map back to their measurement directions. 
This is related to the problem of sound-source localization 
(SSL) where under simple (anechoic) conditions, the direction 
of an acoustic event can be inferred from multi-receiver 
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recordings of the sound spectrum by expressing the spec¬ 
tral cues solely in terms of the receiver’s transfer functions 
(independent of their actual content). This is of interest in 
robot perception (e.g. for event detection and localization ||6l, 
CD, where the receiver’s transfer functions can be measured 
beforehand. For humans, this problem is restricted to two 
receivers (human ears) where functions of left and right pairs 
of HRTFs are mapped to their measurement directions in place 
of SSL directions. Thus, it possible to model this relation as 
either a classification or a regression problem between the 
two domains. Many works in literature have attempted similar 
tasks. 


A. Prior Works 

Cue-mapping ^ uses ITD, ILD, and interaural enve¬ 
lope difference features paired with azimuth directions in a 
weighted kernel nearest-neighbor (NN) setting. A linear map¬ 
ping between ITD, ILD, and HRTF notch frequency features 
to spherical coordinates can be learned ii. A self-organizing 
map between input ITD, spectral notches features and output 
horizontal and median plane coordinates can be trained El. 
Conditional probability maps derived from per-frequency ITD 
and ILD can be used to estimate direction via a maximum 
a posteriori estimator (H. A probabilistic affine regression 
model between interaural transfer functions and the direction 
is possible ED. 

Most closely related to our work are the source-cancellation 
and match-filtering algorithms 021, (El, C3, Ga, where 
the binaural recordings (Sl left, Sr right ears) are represented 
as convolutions of a common sound-source signal S and the 
appropriate filters; for recording done in an anechoic space, 
these filters are the same-direction HRTFs (Hr left, Hr right 
ears). The per-frequency domain representation is given by 

Sl = HloS, Sr=HroS, (1) 

where o is element-wise product. The source-signal S is 
removed by computing the ratio between left and right channel 
recordings (^ = ^). These binaural features, which are 
reduced to ratios of HRTFs, can be compared to those pre¬ 
computed from the subject’s collection of measured HRTFs; 
the measurement direction belonging to the maximally cross- 
correlated pair is reported as the sound-source direction. 
Such an approach can be interpreted as a nearest neighbor 
(NN) classifier where the binaural features and measurement 
directions are single class instances and labels respectively. 
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B. Present Work 


We propose a generalization of the match-filtering algorithm 
that addresses several deficiencies: While an NN classifier 
is accurate for a large number of training samples, it does 
not report out-of-sample spatial directions unless specified in 
a regression context. Linear r^ression methods via ordinary 
least squares (OLS) regressor^ often perform poorly due to 
inaccurate assumptions on the model complexity (number of 
parameters) and the linearity between predictors and outputs. 
Common issues include over-fitting the model to noise that 
arise from parametric OLS methods and under-fitting the 
training data from assumptions of linearity. Instead, we adopt 
a non-linear and non-parametric0 Gaussian process (GP) re¬ 
gression (GPR) csi framework to address these issues. 

GPR is a kernel metho^ that places weak assumptions on 
the joint probability distributior|^of latent function realizations 
that would model the output observations (spatial directions) 
in a Bayesian setting. Observations are drawn (realized) from 
a high-dimensional normal distribution that represents the 
joint probability density function of a collection of random 
variables indexed by their predictor variables. GPs have several 
attractive properties that are well-suited for SSL. 

Based on the observation that HRTFs corresponding to 
different spatial directions covary smoothly with the consid¬ 


ered binaural features (see sections |I^, we show they can 
be modeled via simple stationary GP covariance functions 
(see section |IV]). The GP Bayesian formulation allows for 
the choice of the covariance function, which governs the 
smoothness between realizations at nearby predictors, to be 
automatically selected by evaluating a data marginal-likelihood 
criterion (goodness-of-fit); covariance functions belong to a 
function class and are specified by their “hyperparameters” 
(parameters that describe distributions). This allows the co- 
variance function hyperparameters to be learned without the 
need for cross-validation and provides insights as to the 
intrinsic dimensionality of the high-dimensional feature space 
that the binaural features are mapped to. Most importantly, 
uncertainties in GP prediction are well-defined in terms of 
both prior and posterior distributions; the predicted variances 
at different inputs are tractable. Thus, GPR generalizes NN 
classifiers as it makes non-linear inferences to observations 
outside the training set. By the representer theorem, kernel 
methods such as support vector regression (SVR) Cll and 
GPR make predictions expressible as linear combinations of 
non-linear covariance evaluations between the training fea¬ 
tures/observations and the test features. 

In general, GPs perform better (make accurate inferences) 
with more observations (data) than other non-linear regres¬ 
sion methods that do not encode and select for prior data- 
assumptions. The trade-off is its high computational costs 


= x^j8, j8 = ^for parameters j8 

^Number of parameters is proportional to the number of data samples 
conditioned upon for inference. 

^Predictor variables are implicitly mapped to a reproducing kernel Hilbert 
space whose inner products are taken to be evaluations of a valid Mercer 
kernel or covariance function. 

^Normal distribution defined by prior mean and covariance functions of 
predictor variables (binaural features). 


(O (A^^) operations for N number of observations) for both 
model-selection and inference; scaling GPs for for large 
datasets is an active field of research. Fortunately, the avail¬ 
ability of high quality datasets, computational resources, and 
faster algorithmic formulations have allowed us to overcome 
these problems. In previous works, we have used several 
properties of HRTF datasets to to perform fast GP based HRTF 
interpolation ca and data-fusion ca. The current work is a 
major extension of our recent work on binaural SSL 1201 . For 
future references, we refer to GPs that predict SSL directions 
as GP-SSL models (see section llv] for a complete derivation). 


11. Formulation of Problems 

This work investigates two problems related to GP-SSL 
models (see Fig. ). For notation, we refer to a binaural 
feature as a D-dimensional vector x G (D is number 
of frequency bins), the measurement direction as the unit 
vector y G (M = 3 for the standard Cartesian basis), 
and collections of the aforementioned quantities {N number 
of samples) as concatenated into matrices X G and 

Y G The binaural features are independent of the sound- 

source content and thus strictly functions of the subject’s 
HRTFs (see section |IIJ. GP-SSL models are thereby specified 
and trained over known HRTFs and measurement directions 
belonging to CIPIC EB database subjects. 



Fig. 1: Gaussian Process Regression with binaural features (bottom 
two boxes) to perform two types of inferences. On the left are shown 
the steps needed to perform sound-source localization. On the right is 
shown an active-learning framework that combines SSL with listening 
tests to learn a listener’s HRTFs. 
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A. Feature sub sets election 

Subset-selection for non-parametric methods such as NN 
and GPR is an important technique for reducing the model- 
order complexity and run-time costs for inference. SSL models 
that are trained with randomized subsets of samples trade 
measurement and prediction costs for localization accuracy. 
Increasing the density of measurement samples over the spher¬ 
ical grid results in a linear increase to both NN classification 
computational cost and accuracy, a quadratic and cubic in¬ 
crease to respective GP inference and training computational 
costs, and a non-linear increase to GP localization accuracy. 
We show how GP-SSL models using small and non-uniform 
subset-selected samples (which are most informative) make 
more accurate predictions over the full spherical grid than 
models evidenced on a randomized subset. 

A simple greedy forward-selection (GFS) algorithm f22\ 
that sequentially incorporates training samples into a subset 
without considerations in future iterations is implemented. It 
ranks all training samples outside the subset via a user-defined 
objective function (risk function) and adds the minimizer 
into the subset. We propose a class of risk functions that 
generalizes the GP prediction errors and show that the subset- 
selected GP-SSL models localize directions more accurately 
than models evidenced on randomized inputs (see section 0; 
only a small fraction of training samples are required for 
reasonable accuracy (5°). 

B. Active-learning for individualizing HRTFs 

Individualized HRTFs are needed for synthesizing accurate 
spatial audio that resolve front-back and up-down directional 
confusion m, o, Qi. Due to the difficulties of directly mea¬ 
suring HRTFs ll23i . a number of works have sought indirect 
means for learning the subject’s HRTFs: regression models 
between the individual’s physically measured anthropometry 
and his/her HRTFs can be learned via neural-network El 
and multiple non-linear regression models ll^ but do not 
generalize well to test subjects. HRTFs can also be learned 
through listening tests 1^ . Ea by having an individual listen 
to a query HRTF x convolved with white Gaussian noise 
(WGN) (heard over a pair of headphones), localize the test 
signal (report a direction v G M?), and then hand-tune the 
spectra of x or choose a new x out of a large candidate pool 
over a graphical user interface (GUI) as to move subsequent 
localizations towards a target direction u G The hand¬ 
tuning/selection step can be replaced by developing a recom¬ 
mendation system that selects for the query HRTF between 
rounds (steps) of localization. The listener can rank candidate 
HRTFs chosen from a genetic algorithn0 EH- HRTFs can 
also be tuned along a low-dimensional autoencoder space 1^ 
where u is unknown to the listener. 

We propose to formulate the recommendation problem in an 
active-learning Eol context described as follows: given a finite 
set of candidate HRTFs sampled from a prior distribution 
(database or generative model), determine the HRTF from the 
that the listener would localize nearest to u within T rounds 

^Evaluates a fitness function w.r.t. localization accuracy of known u 
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of localizations. During round t <T, the recommender selects 
a query x that the listener labels as v^(x) without knowledge 
of u. The choice of x is referred to as the query-selection 
problem of minimizing the SSL error (SSLE) (modified cosine 
distance) given by 

SSLE(u,v^(x)) = -u^v^(x), arg min SSLE (u, v^(x)). (2) 

Unfortunately, the minimizer in Eq.j^is unlikely to be found 
within T rounds as X^ can be large and T must also be small 
as the cost of evaluating SSLE by the listener is high. It is 
more reasonable to model the SSLE function using an online 
regression model (adapting HRTEs predictors of SSLEs after 
each round) and select for x based on two competing strate¬ 
gies: query-selection exploits the online model by choosing x 
that the model predicts will have low SSLE and explores x that 
has high model uncertainty in its prediction; both concepts 
are trade-offs that require probabilistic treatments of model 
predictions. Eortunately, GPs are well-suited to this task as all 
predictions are expressed as probabilistic realizations sampled 
from normal distributions. Thus, we propose to solve the 
modeling problem via GP-SSLE^ and the query-selection 
problem using a method of GPs for the global optimization 
of smooth functions ED, |[32l (see section lYllf. The relation 
between these methods and the GP-SSL models is also shown. 

HI. Binaural Sound-Source Invariant Eeatures 

We consider several sound-source invariant features that can 
be extracted from short-time Eourier transforms of the left 
and right ear input channel streams in Eq. (see Table and 
Fig. 13; it is useful to express the discrete Eourier transformed 
signals by their magnitude and phase representations where 
H{j(o) = The features are expressed as ra¬ 

tios between left and right ear channel recordings that remove 
the effects of the acoustic content in S; the remainder is strictly 
a per-frequency function of same-direction left and right ear 
HRTEs derived as follows: 


Tab. I: HRTF sound-source invariant features X 


1 

+ l) =log(^ + l) 

Log-magnitude ratio 

N 

II 

K 

1 

K 

Phase difference 

0.5(|S 

VI _ 2\Hl\ 

d+lVl) \Hl\ + \Hr\ 

Avg. magnitude ratio 



Magnitude pairs for flat S 


Log-magnitude ratio (LMR) lHH : While the source- 
cancellation method removes the dependence on signal S, 
the resulting features are complex, noisy, and difficult to 
interpret. This can be avoided by considering the magnitude 
representation which gives the relative per-frequency energy 
between the channel signals. Adding a constant to the ratio 
prior to the log-transform penalizes the magnitude of the 
perturbation; adding a constant 1 constrains the log-transform 
to be non-negative. 

Phase difference (PD): Similarly, the per-frequency phase 
of the complex channel signal ratio can be expressed by the 
phase-difference between left and right HRTFs. For identical 

^GPs that predict the SSLE from HRTFs 
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Sl^Sr that differ by onset time-delays Ar^Ar, the phase- 
difference is simply the constant delay Ar — Ar across all fre¬ 
quencies; this ITD can be related to azimuth angles via Wood¬ 
worth’s model (331. For arbitrary Sr, Sr, the per-frequency 
phase-differences differ and are to be treated as independent 
variables in regression models. 

Average magnitude ratio (AMR): The magnitude source- 
signal \S\ can also be removed by taking the ratio of left 
or right magnitude signals and the binaural average 

{\Sr \ \Sr\)/2. Without the constant factor, the feature can 

be interpreted as the per-frequency contribution of the left 
or right magnitude HRTFs to the additive binaural magnitude 
response. Unlike log-magnitude ratio features that approaches 
a singularity as ^ 0, these features are bounded in the 
interval [0,2) and finite everywhere unless the binaural average 
is zero. 

Magnitude pairs (MP): The magnitude pairs are the con¬ 
catenation of the original left and right magnitude HRTFs that 
could be derived from convolution with a WGN S with zero 
mean and unit variance. The power spectrum of is constant 
across all frequencies and so would be constant 

factors of magnitude HRTFs. Such conditions arise during 
listening tests where the source-signal S can be specified; the 
test features can then be derived from per-frequency division 
given by Hr = Sr/S and Hr = Sr/S. 


Log-magnitude Ratio Feats. 


Phase Difference Feats. 







Fig. 2: Binaural features extracted from CIPIC subject 156 HRTFs 
are shown for horizontal and median plane directions. 


IV. Gaussian Process Regression for SSL 

In a general regression problem, one predicts a scalar target 
variable y from an input vector x of independent variables 
based on a collection of available observations. A common 
Bayesian approach for inference assumes that the observation 
y is generated (realized) from a latent function /(x) given by 

y = f{x) + e, e~yr(0,cr2), (3) 

which is corrupted by additive Gaussian white noise with 
zero mean and constant variance This latent function is 
given the form of a kernel regression /(x) = 0(x)^j3, j3 ^ 
^(0,Zp) where the function 0(x) : ^ maps the 

inputs X into a high-dimensional space before computing the 
inner product with a vector of parameters realized from a 
collection of random variables with a prior multivariate normal 
distribution j3 G . Unlike linear regression, the parameters 
P are not explicitly found in order to perform inference but 
are marginalized in order to compute the first two moments 
(mean and covariance) of function /(x) given by 

E[/(x)]=<^)(x)^E[/3]=0, 

E [/(x)/(xO] = <?>(x)^E (/.(x') = 

The latent function realizations /(x) are thus drawn from 
a multivariate normal distribution with mean /i(x) =0 and 
variance k(x,x') = 0(x)^Zp0(x'). For =/, the inner prod¬ 
uct can be replaced with the covariance function k(x,x') = 
0(x)^0(x') which GPs generalize as follows: 

A GP / is a collection of random variables where any 
finite subset indexed at N inputs X = [xi,... ,Xiv] has the joint 
multivariate normal distribution given by 

[/(xi),... ,/(x;v)] - (5) 

and thus fully defined by the prior mean function /i(x) 
and the prior covariance function k(x,x'). The prior mean 
function and vector /i(X) G are set to zero without loss 
of generality following Eq. The covariance (Gram) matrix 
K{X,X) G is characterized by the pairwise covariance 

function evaluations Kij = k{xi,Xj); the covariance function 
is a positive semi-definite kernel (Mercer’s condition) that 
establishes the existence of the eigenfunction (j) (x). This allows 
kernel methods such as SVR and GPR to omit computing the 
exact mapping (j) as the inner products in the high-dimensional 
space, representing the similarity measure between input fea¬ 
tures x,x', are well-defined. 

GP inference at test inputs X^ G evidenced on 

training inputs X and the observations in F G derives 
from the multivariate normal distribution of random variables 
/* = f{X^) conditioned on f{X) =Y, X. This is given by 

r ' r (6) 

where K = K{X,X)^ G^I adjusts for the observation noise and 
Kf^ = K(X^X^) ^ are pair-wise covariance evaluations 

between training and test inputs. We refer to the distribution 
in Eq. as the posterior GP defined by the posterior mean 
and posterior covariance functions /* and respectively. The 
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former represents the vector of expected outputs (prediction) 
at and the latter is gives the confidence intervals (diagonal 
of the matrix) of the predictions. 

For the GP-SSL model, X and Y G are the respective 
binaural features in Table. [T| and their measurement directions 
(unit vectors where Yi = T / are values along the coordinate); 
test inputs X^ refer to the binaural features extracted from 
test signals. While it is possible to model all M = 3 output 
coordinates as a collection of M independent GPs = 

{/i(X),.. ./m(^)}, a computationally cheaper alternative is to 
specify a common prior mean and covariance function shared 
by all GPs. Specifying a shared covariance model between GPs 
is reasonable as the original HRTFs are originally measured 
over the same physical topology of a human subject from a 
near-uniform spherical grid of directions. Thus for inference, 
we use three independent GPs, with shared priors, to model 
left-right, front-back, and top-down coordinate directions by 
either sampling from their posterior distribution or reporting 
their posterior means. 


A. Choice of Covariance Functions 

The “smoothness/correlatedness” of realizations of f{X) 
for similar X depends on the number of times that the 
covariance function is differentiable w.r.t. the input arguments. 
Consider the Matern class of covariance functions where 
each function has varying orders of differentiation. For D- 
dimensional inputs, we can specify the GP covariance function 
as the product of ^-independent Matern covariance functions 
of identical class. Three common classes and the product 
covariance function are given as 


Ki^{r4)=e-i, 

A 

Ko.{r,£)=e 2 < 2 , 


D 

k=\ 


(7) 


for distance r and hyperparameters a, 4- Covariance functions 
Ky are [vj times differentiable and stationary due to their 
dependence on |xy^ — x^|. Each function contains a length-scale 
or bandwidth hyperparameter 4 that represents a distance in 
the domain Xy^ where outputs /(xy^) remain correlated; larger 
length-scales result in smoother /. 

A general hyperparameter 0 is optimized by maximizing 
the data log-marginal likelihood (LMH) of the observations Y 
given the GP prior distributions; the derivation follows from 
integrating over the realizations f{X) by the product of data 
likelihoods (sampling Y from /(X)-fe and sampling f{X) 
from the GP prior distribution). The LMH term L = log p{Y\X) 
and its partial derivative are both analytic and given by 


L 



tr (Y^k-^Y) 
M 


+ Alog(2;r) j , 


dL 

< 70 ; 


M 

~2 


tr {k~^P) 


tr {Y^k-^PK-^Y) 


M 


( 8 ) 


where P = dk/dO is the matrix of partial derivatives. A larger 
LMH represents a better goodness-of-fit of the data to the GP 


prior mean and covariances assumptions. Moreover, different 
covariance functions with optimized hyperparameters can be 
compared in this respect without resorting to domain-specific 
metrics. 


B. Model-Order and Cost Analysis 

The GP model-order is proportional to the size of the GP 
prior distribution defined by the A-dimensional multivariate 
normal distribution in Eq. {N is the number of training 
samples). The associated costs of both conditioning on the GP 
prior distribution for inference and performing hyperparameter 
training is dominated by the inversion of the Gram matrix 
(O (A^) operations to compute and O (A^) space to store). Lor 
large A, exact GP becomes intractable and most practitioners 
rely on randomized sampling techniques El to reduce the 
costs at the expense of accuracy. Two types of analyses 
for evaluating this trade-off are given: first, empirical cross- 
validation experiments can demonstrate how data sampling 
(randomized and subset-selection) increases localization error. 
Second, the theoretical dimensionality of the feature space 
0(x) in Eq.[^ despite not having been explicitly computed, can 
be estimated from an eigenanalysis of the GP Gram matrix. 
The distribution of eigenvalues (number of dominant ones) 
gives a minimum bound as to the number of input features 
whose mapping will contain most of the variances in the 
feature space. 

To evaluate the dimensionality of 0(x), we refer to the 
method of kernel principal component analysis 1^ of Gram 
matrix K. Its derivation expresses the eigenvectors v (principal 
directions) and eigenvalues A (measure of variance captured 
by v) of the sample covariance matrix C of features 0(x) in 
the high-dimensional space in the form of 

*' = 2v, (9) 

where at = 0(xy)^v are the component scores between the 
feature mapping and the eigenvector. Applying the “kernel” 
trick allows a to be reformulated in terms of the Gram matrix 
^ as a tractable eigendecomposition problem given by 


N N 

;=i y=i 

Kij = (l){xif(l){xj), 


1 N N 

^ yV T T 

j=ii=l 

Ka = XNa, 


( 10 ) 


which finds the eigenvalues A and scores a. Evaluating the 
contributions of the leading A to the total energy E^LiAy 
estimates the number of eigenvectors that are relevant to (j) (x). 


C. Experiments 

GP-SSL models (input binaural features LMR, PD, AMR, 
and MP from Table. |T| belonging to CIPIC subject 156) are 
trained (batch gradient descent of all covariance function 
hyperparameters 4 via Eq.[^ for 50 iterations. Eor a domain- 
metric, we use the angular separation distance between two 
directions u, u’ (predicted and reference directions) given by 


dist(u,u’) = cos 


< u,u’ > 


u u 


u,u’ G 


( 11 ) 


’ll ’ 
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Tab. II: Data LMH for feature/GP covariance types 


Tab. Ill: Mean angular separation errors (degrees) for feature/methods 



LMR 

PD 

AMR 

MP 

/foo 

2.69e+003 

2.37e+003 

3.9e+003 

6.34e+003 

^3/2 

2.23e+003 

L5e+003 

3.88e+003 

6.29e+003 

^1/2 

2.06e+003 

460 

2.24e+003 

4.84e+003 


Goodness-of-fit: GP-SSL models are specified/trained on 
the full set of inputs X. The data LMHs in Table. |n| are 
computed for several covariance functions and feature types. 
The infinitely differentiable squared exponential Koo gives the 
best-fit (highest LMH) across all features (latent functions 
modeling the SSL directions are smooth w.r.t. changes in the 
feature space). This confirms the fact that a finite collection of 
HRTFs approximates a sound-pressure field that is continuous 
in space. The best-fitting binaural features are the MPs (WGN 
sound-source) and AMRs (arbitrary sound-source); the LMH 
gap between the two suggest that GP-SSL models will perform 
more accurately when the recorded magnitude spectra match 
that of the HRTFs. The LMH gap between AMR and LMR 
suggests that relative contribution may be a better indicator 
of SSL than relative intensities. The low LMH of PD models 
suggests that phase may not be useful for SSL over the entire 
spherical coordinate system. 

Eigenanaylsis of K: The eigenvalues of the K are com¬ 
puted for GP-SSL models trained/specified on the full dataset 
(N = 1250). Fig. shows the contribution of the leading 
eigenvalues to the total energy; Koo specified by the four earlier 
features (LMR, PD, AMR, and MP) require respectively 150, 
30, 100, and 15 leading eigenvectors to capture 90% of the 
total variance. The results suggest that feature mappings for 
MPs and PDs can be approximated with only a few samples 
while LMR and AMR feature mappings are more complex. 



Fig. 3: Cumulative energy of leading eigenvalues for K are shown 
for GP-SSL models (varying covariance functions and feature types). 


Cross-validation: GP-SSL models are trained on a random¬ 
ized third of the available feature-direction pairs (A = 417 
out of 1250); inference follows Eq. at all available in¬ 
puts (X^ = X) where only the posterior mean directions are 
reported. Table |nl| shows the mean angular separation (Eq. 

between predicted and reference directions for GP-SSL, 
NN classifier, OLS methods trained on the same data. Non- 
parametric methods (NN and GPR) outperform parametric 
methods (OLS) across all feature types. The MP and AMR 
features give the lowest errors across all methods (for a visual. 



LMR 

PD 

AMR 

MP 

OLS 

29 

27 

22 

5.4 

NN 

9.2 

20 

7.9 

3.9 

GP-SSL K^,2 

12 

12 

7 

1.8 

GP-SSL K 2/2 

7.5 

11 

4.8 

1.4 

GP-SSL ^00 

6.3 

6.3 

4.8 

1.3 


see the first column of Fig. |^. OLS log-ratios perform the 
worse and suggest that the features are oversensitive linear 
predictors of change in localization direction. PD features, 
while useful for predictions on the horizontal plane, are 
insufficient for localizations over the full sphere. 


V. Feature Subset-Selection 

Greedy feature selection is an efficient method for finding 
a subset of inputs Xr^X that best approximates a functional 
f{Xr) ~ f{X) according to a user-specified risk function R{Xr) 
(measure of distance between f{Xr) and f{X)). Determining 
the optimal subset via an combinatorial exhaustive search is 
prohibitive w.r.t. the number of evaluations of R. A greedy 
heuristic (ranking Xf^^y according to a point-inclusion in the 
risk evaluation R(Afur) and adding the minimizer into the sub¬ 
set Xy without consideration in future iterations) reduces the 
search to a quadratic number of evaluations (see Algorithm [T]). 
For GP-SSL, GFS approximates the GP posterior distribution 
(Eq.|6l. evaluated on the full dataset (X^ = X) conditioned on a 
growing subset X^uy of inputs. We propose an efficient method 
for updating both GP prior and posterior distributions between 


point-inclusions in section V-A 


Algorithm 1 Greedy Forward Selection 

Require: Training inputs A,y, subset size T, and risk function 
R{X). 

1: r ^ 0 \\ Initial empty subset at iteration t = 0 

2: for r = I to r do 

3: r ^ { r, arg minp^y R {Xf^ur )} \ \ Minimize risk 

4: end for 
5: return r 


Specifying the risk function R is more difficult as its 
evaluation costs must be low. Most risk functions that use 
second-order moments (e.g. GP posterior covariance in Eq. 
are expensive and require approximations to remain tractable 
(361. Evaluating the GP posterior covariance requires O (A^) 
space; its inverse and determinants are expensive to compute 
in sub-cubic time. Instead, we propose a cheaper class of risk 
functions that generalizes only the first-order moments (i.e. 
GP posterior mean in Eq. in section 

A. Incremental GP Models 

A point-update to a GP model can be defined in terms 
of changes to the first/second moments of the GP prior and 
posterior distributions (Eqs. |5l§ and both the Gram matrix 
= K(Xy^Xy) and its inverse generated from inputs 

in Xy. While a point-update to simply contains an 

appended row and column of covariance function evaluations 


V-B 
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Magnitude Pairs Feats., Randomized Training Set 



Magnitude Pairs Feats., Subset-Seiected Training Set 
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Fig. 4: Mercator projectio ns of GP-SSL Koo 
risk function R in section |V-B| are shown. 


predicted mean directions 


evidenced on randomized and subset-selected inputs (prediction error 


[K{Xr^Xf )its direct inverse would be expen¬ 

sive to compute. Instead, we define a recurrence relation with 
its previous inverse as follows. 

Given a sample input-output pair for data index r, 

let indices r = r U f be the union with the subset indices r. At 
iteration t, append a row and column vector along the standard 
basis to the Gram matrix The differences between 
and the appended are two rank-1 updates given by 


diagonals given by 

^11= K^y-U^ Sy = K^yV ^ 

(^*r) — diag diiSiiS^ dySySy ^, 


(14) 


where matrix = K{X^^Xf). The updated log-determinant is 
given by log | = log \K\ —logdudy. The total computational 
costs of updating the GP prior and posterior distributions at 
iteration t are O and O {Nj) operations respectively. 



^(r) 

•^rr ^rr 

K(Xr,X,), 


K(r) 0 

0 1 


T T 
— UU + VV , 


krr = K{Xf,Xf) + a^, 


( 12 ) 



V = 


-et 


)■ 


w = 


, and et is the column of the identity 


matrix. The update in Eq. 1 12| allows to follow from the 

modified Woodbury formulWjn lEil given by 


K-y =K~'^+ duUU^ -dyvv'^, K~^ 


K, 


-1 


(^) 

0 


0 

1 


u = K du = {l—<u^u>) \ 

V = -\-duUif) V, dy = (1+ < v,v >)“^, 


(13) 


which requires only two rank-1 updates. For a fixed set of 
test inputs X^, the updated posterior mean vector remains a 
matrix-vector product and the posterior variances are sums of 


B. GP I? Risk Function Criterions 


We show how several risk functions can be derived from 
the distance between any two GP posterior mean functions 
evaluated at a possibly infinite sized set of test inputs X^. 
Given two GPs fa, ft defined over the subsets of inputs Xa^Xij 
for indices a and the distance between their two GP 
posterior mean functions {fa = K^aKf^Ya and = K^hK^^Y^) 
is analytic under certain GP prior assumptions. For prior mean 
m{x) =0 and the product of identical Matem class covariance 
functions in Eq. [7j the errors evaluated at X^ are given by 

Ll{fa,fb)= £ ifa-fbf 

(15) 


= zl QaaZa “ QabZb + zl QbbZb , 

where vectors Za = Kf^Ya G Zb = Kf^Yb G are com¬ 
puted over training data. Updating the risk function evaluations 
between successive iterations t is efficient as updating fa, fb 
need only rank-1 updates via Eq. 13 The associated matrices 
Qab, Qaa, Qbb in Eq. [T^ are sub-matrices of Qxx and can be 
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pre-computed in O (A^^) operations. Computing Qxx depends 
on the following cases. 

Finite Case: If is finite, then matrices Qaa = 
e Qab = Ka,K,b G and 

Qbb = Hx^ex^^b^K^b ^ are the summation of outer- 

products whose /,/^ entries are products of Matern class 
covariance functions in Eq. [7] 

Infinite Case: If = (— 00 ^ 00 ) is the full (unbounded) input 
domain, then matrices Qaa = J^oo^a^K^adX:^ ^ ]^NaxNa^ _ 
JZKa,K,hdx, e and Qhh = jr^Ki»K,i,dx, G 

contain improper integral entries. For a valid distance measure, 
the posterior mean functions converge to identical zero-mean 
priors at the limits and the improper integrals of 

the form Qa-bj = nf=i Fvijk given by 


[ Fy[\x^^b ^^k\j^ak)Fy(^\Xbih ( 16 ) 

J —00 


^vijk — 


are shown to be finite (see Appendix Eq. |^. Several combi¬ 
nations of the distance are summarized as follows. 

Prediction Error L\ The prediction error is taken 

between the GP posterior means at test inputs X^=X and 
the known sample pairs (^,T). 

Generalized Error (/(^^, f(^x) ) • The generalized error is 
taken between two GP posterior mean functions /(^) and 
evaluated at any finite X^ (may be out-of-sample from X). For 
GFS, the two GPs are specified by subset-selected a = (r) and 
the full set of inputs h = (X). 


Normalized Error L? 


fir) f{X) 


VIIV) II’ll A) I 


The normal¬ 


ized error or ”frequentist“ risk is taken between two normal¬ 


ized GP posterior mean functions ( 


fjo) 


and 


fjb) 


) evaluated 


'II/mII IIA)!. 

at = (— 00 , 00 ) given uniform probability distribution over x*. 
The norm term ||/|| = ^S-oof{^Ydx is shown to be finite by 
setting either of the functions in Eq. [^to zero. The two GPs 
are specified on subset-selected a = (r) and the full set of 
inputs b = {X). 


C. Experiments 

GFS selects for increasing subset sizes until it contains the 
full dataset. At each iteration t, the incremental GP-SSL Koo 
model infers directions (posterior means) along test inputs 
X^=X. The mean angular separation error (Eq. [TT]) between 
the predicted and the reference measurement directions are 
computed and shown in Fig. intercepts with horizontal lines 
indicate subset sizes at 5° and 1° errors. The crossover points 
at the 5° error line (localization accuracy) are achieved for 
MP and AMR features at a small fraction of the total input set 
(approximately 50 and 150 feature-direction pairs); decreases 
in localization error after 50 randomized samples becomes 
logarithmic with diminishing returns. Moreover, GFS selected 
models generalize better than that of randomized selection in 
all but the PD features; a visual (second column plots in Fig. 
1^ shows that the former more accurately localizes directions 
further from the median plane. 

VI. Active-Learner System 

The active-learning process for inferring HRTFs is as fol¬ 
lows. The collection of p number of target directions is 


Subset-Selection Generalization Errors 



Fig. 5: Generalization errors are shown for GP-SSL models evidenced 
on randomized (dotted) and GFS [prediction error (solid), normalized 
error (dashed)] selected subsets of feature-direction pairs. 


specified as u G t/ G For rounds r < T, a query HRTF 

(MP) Xt is chosen from the candidate set X^ and appended 
to form input matrix X G R^^^. The listener localizes x^, 
registers the direction over a GUI (see Fig. [^, and appends 
the directions to form matrix V G The SSLEs w.r.t. U 

are computed in Y^t = SSLE(u,v^) s.t. Y = —U^V G 
Last, the updated feature-direction pairs (X,F) are added into 
the GP-SSLE models via incremental GPs (section |V-A| ). The 
system components are organized below. 



Fig. 6: GUI shows a mercator projection of spherical coordinate 
system onto 2D panel. User clicks on panel to report a direction. 


A. Conditional Mixture of Gaussians Models 

While it is possible to specify an entire HRTF database as 
the candidate set, it is reasonable to assume that most samples 
would not be localized near a target direction u; overt features 
arising from the reflections off the anthropometry may be a 
physical impossibility along all measurement directions. Con¬ 
versely, choosing only HRTFs with measurement directions 
equivalent to u restricts the sample size to the number of 
subjects in the database. To address both issues, we model both 
the HRTFs and their corresponding measurement directions 
using a conditional mixture of Gaussians model (MoG) trained 
from the CIPIC database (see section |VI-A| ). This allows for 
X^ to be drawn from a distribution of HRTFs conditioned at 
any direction u. 
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The MoG models the joint distribution between input 
variables as if the samples are drawn from a latent set of 
normal distributions. The input variables consist of measure¬ 
ment directions u and leading principal components (PCsQ 
w associated with HRTFs along u. The joint distribution is 
modeled by a weighted sum of M normal distributions with 
mean and covariances given by 


w 

u 




Mw 

Mu 


Z = 


y 

^uw 


y 

^wu 


The equivalence allows for the choice of the GP-SSLE 
model’s covariance function to approximated by that of GP- 
SSL models trained over known feature-direction pairs (e.g. 
CIPIC subject data). While these subjects are not identical 
to the listener, the trained GP-SSL models all share simi¬ 
lar covariance functions as their hyperparameters are well- 
distributed (see Fig. |7]); high frequency bands above 17 kHz 
tend to be negligible while lower frequency sub-bands between 
0 — 3 and 4 — 7.5 kHz are relevant. 


M M 

P{z) = (z|;uW,eW) , £7ri = 1, 

i=\ ^ ^ i=\ 


(17) 


Distribution of GP-SSL Hyperparameters 


on u is also a MoG given by 


well-known 

20 

gl5 

111 

I : : 

1 . 

1 > ‘ 

i , . i |S 

ImF ♦ * i = ::l 1 

1(1 ' 

1 

:|S 

conditioned 

TO 

5 

n 

^ ♦ill 


iMil 

WHl 




M KiJP (ulAii 

= I -- / in i/n - 

where the conditional mean and covariance for the 


(18) 




mix- 

_ 


ture are (u - and = Z, 

{dyld respectively. The candidate set is given by 


yVsy 

^wu^u 


18 


PCs randomly sampled from the conditional MoGj^in Eq. 
and decoded into HRTFs to form the candidate set. The non- 
individualized (directional-averaged) HRTFs are approximated 
by the sum of the weighted conditional mixture means. 


Left-ear Hyperparameters 



Fig. 7: Distribution (box-plot) of hyperparameter values are shown for 
GP-SSL models (x-axis 0 — 22.1 kHz frequency range). Large valued 
hyperparameters 4 indicate less sensitivity along the frequency. 


C. Query-Selection 


B. GPs for Modeling SSLE 

GP-SSLE models (/i:p(X) = {/i(X),.. ./^(X)}) are spec¬ 
ified by a common set of input MP features X and output 
SSLEs Y for each of the p number target directions in U. 
Accurate modeling of the SSLE depends on the choice of GP 
prior mean and covariance functions. A zero mean prior is 
reasonable as reported directions v in the absence of local¬ 
ization should average to the zero vector. Choosing the GP 
covariance function is more difficult as the hyperparameters 
cannot be optimized in the absence of observations; inaccurate 
priors would result in poor generalizations error. 

Fortunately, GP-SSLE models can be related to GP-SSL 
models when U is the infinite set of target directions uniformly 
sampled over a unit sphere. Substituting the SSLE labels Y = 
—U^V into Eq. the GP-SSLE LMH is now given by 

L=-^{\U\log\K\+tr{QUU^)+t\U\log{27t)), (19) 

where matrix Q = VK~^V^. As p the sample covariance 
of U approaches a constant variance UU^ = due to 
symmetry. The LMH in Eq. reduces to 

Ls =-iL ^log|^| + ^^^+/log( 2 ;r)^ , ( 20 ) 

which is equivalent to that of GP-SSL models for MP features 
X and directions V. 

^PCs are computed from same-subject, mean-centered, log-magnitude pairs 
(concatenated left and right ear HRTFs). 

^Leading 16 PCs are sampled (via Gibbs sampling) from one of M = 64 
multivariate normal distribution (randomly selected by weight). 


We present GP based query-selection as a modification of 
a known algorithm ED which is derived as follows. Consider 
the observed minimum SSLE for any u at round t given by 

tjut = fRiR(fui 5 • • • 5 Put)- (21) 


Realizations of SSLEs (r =/(x*|X,F)) by the GP-SSLE 
posterior distribution (Eq. at a candidate input G X^ will 
be normally distributed whose mean and variances represent 
the expected SSLE and uncertainty respectively. Thus, im¬ 
provements (lowering) upon the global minimum rjut is given 
by the loss-function Aur( 7 ) =min( 7 , r]ut) whose expectation 
can be computed via marginalizing over the 7 . 

The expected loss-function is analytic for any single u and 
so the weighted expected loss function (specified over each 
ueU with independent GP-SSLE models) is given by 

A(x>|c) = ^ Pu f (7)^^(7|Mu 5 ^u)^7 ~ ^ Pufku: 
uet/ 7-00 

fku = But “L (Mu ~ But)W{But\fluiCu) ~ Cn^yV (tJwIMu^^u)? 

( 22 ) 


where weights Pu = 1 /p can be set to a constant, GP-SSLE 
posterior mean and covariance functions at x^ evidenced with 
(Xi:^ :,Fu 1 :^) are denoted by pu and Cu, and the cumula¬ 
tive normal distribution of Cu is denoted by y/. The query 
HRTE is chosen as the lowest scoring candidate or minimizer 
argmin^^^^c A (x^) of the criterion Eq. 22 which balances 
local improvement through the posterior mean term {pt — rjf) 
with exploring uncertain predictions through the posterior 
variance term Cu- The property is useful for proving the rate 
of convergence |[3^ to the true solution in Eq. 
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D. Experiments 


GP-SSL active-learning trials: One method for fast and re¬ 
peatable empirical validation substitutes the human listener for 
GP-SSL models trained on CIPIC subject data. Localizations 
at can be reported as either the GP posterior mean direc¬ 
tions, or by sampling from the GP posterior distribution. This 
allows for large subsets of to be efficiently evaluated with 
little time costs. For coherence, we limit the query-selection 


criterion in Eq.j^to single target directions u belonging to the 
CIPIC HRTF measurement directions (queries made for past u 
are discarded). GP-SSLE’s covariance hyperparameters are set 
to that of the GP-SSL mean hyperparameters (averaged across 
45 subject models); hyperparameters can be retrained after 
each round but is not necessary for improving the localization 
error. The variance term is set to a = 0.05. 

In tests, the active-learner submits an initial non- 
individualized query HRTF for u and then proceeds through 
r = 50 rounds of query-selection from a candidate HRTF 
set of 20000 samples drawn from a conditional MoG (Eq. 
[T^ . The nearest localized directions are shown to closer 
to their target directions than the non-individualized guesses 
(see Fig. [^. Non-individualized HRTFs are localized closer 
to the horizontal plane and towards the back of the head. 
Nearest localized directions accord with empirical studies of 
difficulties in front-back and up-down confusion with human 
subjects (H. The experiment is repeated across all 45 GP- 
SSL CIPIC subject models (see Fig. [^. The improvement 
can be expressed as the mean ratio between the angular 
separation errors of the initial and nearest localized directions. 
The mean improvement is 7.729 across all CIPIC measurement 
directions, 9.139 for median plane directions, and 8.252 for 
horizontal plane directions. 


Active-Learning: GP-SSL Localizations 



+ nonindv. 

* active-learning 



Azimuth 

Fig. 8: Nearest localized directions after active-learning by the GP- 
SSL model (red) improve upon initial non-individualized HRTF 
localizations (blue). 


Human active-learning trials: For a human listener, we 
develop a simple GUI in Matlab that consists of an azimuth- 
elevation plot that the subject clicks to report v^. To introduce 
contrast in hearing, two test signals are alternatively played 
over headphones until the listener reports a direction. The first 
is a short burst of WGN independently generated for left and 
right ear channels. The second is the WGN convolved with the 
left and right min-phase HRTFs derived from the binaural MP 
features. The trials proceed as the listener localizes queries for 


Active-Learning Localization Errors by GP-SSL Models 



Fig. 9: Mean angular errors are shown for the initial query (non- 
individualized HRTFs) and nearest HRTF queries. 


r = 10 rounds in each of the 14 target directions (7 on the 
horizontal and median planes each). 

For 5 sample human listeners, the initial and nearest (min¬ 
imum) localization errors for each of the target direction 
are shown in Table IV and are compared to synthetic trials 
conducted with the 45 GP-SSL CIPIC subject models. In both 
cases, the largest errors occur along the median plane direction 
6 = { — 1.6,—0.69}. The mean percentage improvements of 
the nearest localizations over that of the non-individualized 
HRTFs are 49% and 43% for human and GP-SSL listeners re¬ 
spectively. GP-SSL localization errors are generally lower and 
more consistent across all direction than the human listeners; 
GP-SSL models can report a posterior mean direction whereas 
human listener exhibit variances in his/her localizations, even 
for identical test signals. It may be of interest in future work 
to both measure and model human localization variances via 
the GP-SSL’s variance term a and by sampling localizations 
from the GP posterior distribution. 


Tab. IV: Active-learner: non-individualized and minimum horizontal 
(j) and median 6 plane localization errors (degrees) 



GP-SSLo 

GP-SSL^n 

Humano 

Hurnaiimin 

(j) : -2.4 

23.1 ±15.8 

12.6±9.01 

42.5 ±35.6 

16.4±7.43 

(}>:-1.6 

19.9±12.1 

10.4±7.49 

34 ±14.4 

5.98±7.17 

: -0.79 

24.6 ±16.7 

7.45 ±4.88 

56.7±17.5 

28.8 ±14 

: 0.79 

22 ±16.2 

7.87±5.12 

48.7±18 

21.5±13.6 

(f) : 1.6 

15.8±9.38 

6.63 ±3.68 

23.7 ±10.6 

10.8±5.23 

:2.4 

22.7 ±14.7 

13.2±7.06 

31.2±11.6 

14.9 ±5.26 

6:-1.6 

55.6 ±26 

37.1 ±20.8 

119±43.3 

59.8 ±29.5 

e : -0.79 

105 ±44.9 

37.9 ±20.9 

104 ±37.3 

61.8±22.4 

0:0 

44.1 ±44 

11.6 ±9.75 

39.2±22.1 

23.3 ±9.82 

e : 0.79 

35.9±23.2 

15.8±11.1 

24.7 ±12.3 

15.3 ±4.76 

0: 1.6 

31.9±18.4 

15.6±9.5 

55±23.1 

30.2 ±25.9 

0:2.4 

17.2±14.8 

10.8±7.38 

83.6 ±56 

24.3 ±23.9 

0:3.1 

24.5 ±19.6 

12.6±6.88 

92.7±68.1 

11.9±8.72 

0:3.9 

26.1 ±17.1 

8 ±5.67 

61.5±42.7 

18.6±11.1 


VH. Conclusions 

We developed a robust method for the SSL using sound- 
source invariant features derived from left and right ear HRTF 
measurements. Our GP-SSL models generalized NN based 
approaches and were shown to more accurate in both cases of 
randomized and sub set-selected features; good spatialization 
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accuracy (5°) over the full sphere was possible using a fraction 
of the available features. For learning HRTFs in listening 
tests, we developed an active-learning method for query- 
selection using GP models. Both simulations with offline GP- 
SSL models and HRTFs recommended to real human listeners 
have shown large improvement in localization accuracy over 
non-individualized HRTFs. 


Appendix A 

Matern Product Integrals 


Improper integrals in Eq. 16 have closed-formulations: 




^\ijk 


^ijk ~ 


lake ‘ak hk 


'^■lak^hk 
k-ak ^ *^bk 




Wak - Phk - cc)e ^ak 






\ 


+ ^bki^bk + P^ak ~ 


4Lkl 


ak^bk 


a 


= -V3 


^ttik ^bjk 


, P = 


i^ak ^ik) 

^^ak^bk 


2 ’ 


y 


^ak ^bk 


mjk 


= e K‘lk+ik) ^“k^bkV^ 

\J^lk + ^lk 


(23) 


References 

[1] J. Blauert, Spatial hearing: the psychophysics of human sound localiza¬ 
tion. Cambridge, Massachusettes: MIT Press, 1997. 

[2] C. Cheng and G. Wakefield, “Introduction to head-related transfer 
functions (HRTFs): Representations of HRTFs in time, frequency, and 
space,” in Audio Engineering Society Convention 107, 1999. 

[3] A. Kulkarni and H. Colburn, “Role of spectral detail in sound-source 
localization,” Nature, vol. 396, no. 6713, pp. 747-749, 1998. 

[4] E. Wenzel, M. Arruda, D. Kistler, and F. Wightman, “Localization using 
nonindividualized head-related transfer functions,” JASA, vol. 94, p. Ill, 
1993. 

[5] G. Romigh, D. Brungart, R. Stern, and B. Simpson, “The role of spatial 
detail in sound-source localization: Impact on HRTF modeling and 
personalization.” in Proceedings of Meetings on Acoustics, vol. 19, 2013. 

[6] J. Hornstein, M. Lopes, J. Santos-Victor, and F. Lacerda, “Sound lo¬ 
calization for humanoid robots-building audio-motor maps based on the 
HRTF,” in Intelligent Robots and Systems, 2006 lEEE/RSJ International 
Conference on. IEEE, 2006, pp. 1170-1176. 

[7] M. Rothbucher, D. Kronmiiller, M. Durkovic, T. Habigt, and K. Diepold, 
“HRTF sound localization,” 2011. 

[8] T. Rodemann, M. Heckmann, F. Joublin, C. Goerick, and B. Scholling, 
“Real-time sound localization with a binaural head-system using a 
biologically-inspired cue-triple mapping,” in International Conference 
on Intelligent Robots and Systems. IEEE, 2006, pp. 860-865. 

[9] H. Nakashima and T. Mukai, “3D sound source localization system 
based on learning of binaural hearing,” in Systems, Man and Cybernetics, 
2005 IEEE International Conference on, vol. 4. IEEE, 2005. 

[10] V. Willert, J. Eggert, J. Adamy, R. Stahl, and E. Korner, “A probabilistic 
model for binaural sound localization,” IEEE Transactions on Systems, 
Man, and Cybernetics-Part B: Cybernetics, vol. 36, no. 5, p. 1, 2006. 

[11] A. Deleforge and R. Horaud, “2D sound-source localization on the 
binaural manifold,” in Machine Learning for Signal Processing (MLSP), 
2012 IEEE International Workshop on. IEEE, 2012, pp. 1-6. 


[12] F. Keyrouz, K. Diepold, and S. Keyrouz, “High performance 3D sound 
localization for surveillance applications,” in Advanced Video and Signal 
Based Surveillance, 2007. AVSS 2007. IEEE Conference on. IEEE, 
2007, pp. 563-566. 

[13] F. Keyrouz, “Humanoid hearing: A novel three-dimensional approach,” 
in Robotic and Sensors Environments (ROSE), 2011 IEEE International 
Symposium on. IEEE, 2011, pp. 214-219. 

[14] F. Keyrouz and K. Diepold, “An enhanced binaural 3D sound localiza¬ 
tion algorithm,” in Signal Processing and Information Technology, 2006 
IEEE International Symposium on. IEEE, 2006, pp. 662-665. 

[15] A. Pourmohammad and S. Ahadi, “TDE-ILD-HRTF-Based 3D entire- 
space sound source localization using only three microphones and source 
counting,” in Electrical Engineering and Informatics (ICEEI), 2011 
International Conference on. IEEE, 2011, pp. 1-6. 

[16] C. E. Rasmussen and C. Williams, Gaussian Processes for Machine 
Learning. Cambridge, Massachusettes: MIT Press, 2006. 

[17] A. J. Smola and B. Scholkopf, “A tutorial on support vector regression,” 
Statistics and computing, vol. 14, no. 3, pp. 199-222, 2004. 

[18] Y. Luo, D. N. Zotkin, and R. Duraiswami, “Statistical analysis of head 
related transfer function (HRTF) data,” in International Congress on 
Acoustics, 2013. 

[19] -, “Gaussian process data fusion for heterogeneous HRTF datasets,” 

in WASPAA, 2013. 

[20] -, “Gaussian process models for HRTF based 3D sound localization,” 

in ICASSP, 2014. 

[21] V. R. Algazi, R. O. Duda, and C. Avendano, “The CIPIC HRTF 
Database,” in IEEE Workshop on Applications of Signal Processing to 
Audio and Acoustics, New Paltz, NY, 2001, pp. 99-102. 

[22] 1. Guyon and A. Elisseeff, “An introduction to variable and feature 
selection,” Journal of Machine Learning Research, vol. 3, pp. 1157- 
1182, 2003. 

[23] D. Zotkin, J. Hwang, R. Duraiswaini, and L. S. Davis, “HRTF personal¬ 
ization using anthropometric measurements,” in Applications of Signal 
Processing to Audio and Acoustics, 2003 IEEE Workshop on. leee, 
2003, pp. 157-160. 

[24] H. Hu, L. Zhou, H. Ma, and Z. Wu, “HRTF personalization based on 
artificial neural network in individual virtual auditory space,” Applied 
Acoustics, vol. 69, no. 2, pp. 163-172, 2008. 

[25] Q. Huang and Y. Fang, “Modeling personalized head-related impulse 
response using support vector regression,” J Shanghai Univ (Engl Ed), 
vol. 13, no. 6, pp. 428-432, 2009. 

[26] K. Fink and L. Ray, “Tuning principal component weights to individu¬ 
alize HRTFs,” in ICASSP, 2012. 

[27] A. Silzle, “Selection and tuning of HRTFs,” in Audio Engineering 
Society Convention 112. Audio Engineering Society, 2002. 

[28] P. Runkle, A. Yendiki, and G. Wakefield, “Active sensory tuning for 
immersive spatialized audio,” in Proc. ICAD, 2000. 

[29] Y. Luo, D. N. Zotkin, and R. Duraiswami, “Virtual autoencoder based 
recommendation system for individualizing head-related transfer func¬ 
tions,” in WASPAA, 2013. 

[30] B. Settles, “Active learning literature survey,” University of Wisconsin, 
Madison, 2010. 

[31] M. Osborne, R. Garnett, and S. Roberts, “Gaussian processes for 
global optimization,” in 3rd International Conference on Learning and 
Intelligent Optimization (LION3), 2009, pp. 1-15. 

[32] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process 
optimization in the bandit setting: No regret and experimental design,” 
arXiv preprint arXiv:0912.3995, 2009. 

[33] R. Woodworth and G. Schlosberg, Experimental psychology. Holt, 
Rinehard and Winston, 1962. 

[34] C. Williams and M. Seeger, “Using the Nystrom method to speed 
up kernel machines,” in Advances in Neural Information Processing 
Systems, 2000. 

[35] S. Mika, B. Scholkopf, A. J. Smola, K.-R. Muller, M. Scholz, and 
G. Ratsch, “Kernel PC A and de-noising in feature spaces.” in NIPS, 
vol. 11, 1998, pp. 536-542. 

[36] M. Seeger, C. Williams, and N. Lawrence, “Fast forward selection to 
speed up sparse gaussian process regression,” in Artificial Intelligence 
and Statistics 9, no. EPFL-CONF-161318, 2003. 

[37] R. Saigal, “On the inverse of a matrix with several rank one updates,” 
University of Michigan Ann Arbor, Tech. Rep., 1993. 


















